
[39] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen-
nifer Wortman Vaughan, Hanna Wallach, Hal Daum´
e Iii, and Kate
Crawford. Datasheets for datasets. Communications of the ACM,
2021. 25
[40] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin,
Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a
strong data augmentation method for instance segmentation. CVPR,
2021. 16,18,22
[41] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and semantic
segmentation. CVPR, 2014. 10
[42] Priya Goyal, Piotr Doll´
ar, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and
Kaiming He. Accurate, large minibatch SGD: Training ImageNet
in 1 hour. arXiv:1706.02677, 2017. 17
[43] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary
Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger,
Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na-
garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona
Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong-
cong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car-
tillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli,
Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Chris-
tian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis,
Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Ko-
lar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li,
Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Mod-
hugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will
Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran
Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao,
Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu,
Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria
Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar,
Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude
Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi,
Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei
Yan, and Jitendra Malik. Ego4D: Around the World in 3,000 Hours
of Egocentric Video. CVPR, 2022. 20
[44] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for
large vocabulary instance segmentation. CVPR, 2019. 2,6,7,9,10,
11,19,20,21,24
[45] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple
choice learning: Learning to produce multiple structured outputs.
NeurIPS, 2012. 5,17
[46] Timm Haucke, Hjalmar S. K¨
uhl, and Volker Steinhage.
SOCRATES: Introducing depth in visual wildlife monitoring using
stereo vision. Sensors, 2022. 9,20
[47] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´
ar,
and Ross Girshick. Masked autoencoders are scalable vision learn-
ers. CVPR, 2022. 5,8,12,16,17
[48] Kaiming He, Georgia Gkioxari, Piotr Doll´
ar, and Ross Girshick.
Mask R-CNN. ICCV, 2017. 10
[49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. CVPR, 2016. 16
[50] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units
(gelus). arXiv:1606.08415, 2016. 16
[51] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena
Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas,
Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training
compute-optimal large language models. arXiv:2203.15556, 2022.
1
[52] Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A
semantically-segmented dataset towards visual detection of marine
debris. arXiv:2007.08097, 2020. 9,19,20
[53] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein-
berger. Deep networks with stochastic depth. ECCV, 2016. 17
[54] Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov,
and Humphrey Shi. Oneformer: One transformer to rule universal
image segmentation. arXiv:2211.06220, 2022. 4
[55] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig.
Scaling up visual and vision-language representation learning with
noisy text supervision. ICML, 2021. 1
[56] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown,
Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey
Wu, and Dario Amodei. Scaling laws for neural language models.
arXiv:2001.08361, 2020. 1
[57] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes:
Active contour models. IJCV, 1988. 4
[58] Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and
Weicheng Kuo. Learning open-world object proposals without
learning to classify. IEEE Robotics and Automation Letters, 2022.
21
[59] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother,
and Piotr Doll´
ar. Panoptic segmentation. CVPR, 2019. 4
[60] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan
Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo
Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari.
The open images dataset v4: Unified image classification, object
detection, and visual relationship detection at scale. IJCV, 2020. 2,
6,7,18,19
[61] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and
Thomas Dandres. Quantifying the carbon emissions of machine
learning. arXiv:1910.09700, 2019. 28
[62] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Explor-
ing plain vision transformer backbones for object detection. ECCV,
2022. 5,10,11,16,21,23,24
[63] Yin Li, Zhefan Ye, and James M. Rehg. Delving into egocentric
actions. CVPR, 2015. 9,20
[64] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image
segmentation with latent diversity. CVPR, 2018. 5,17,19
[65] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr
Doll´
ar. Focal loss for dense object detection. ICCV, 2017. 5,17
[66] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro
Perona, Deva Ramanan, Piotr Doll´
ar, and C Lawrence Zitnick. Mi-
crosoft COCO: Common objects in context. ECCV, 2014. 2,4,6,
7,11,18,19,20
[67] Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Niethammer. Sim-
pleClick: Interactive image segmentation with simple vision trans-
formers. arXiv:2210.11006, 2022. 8,9,12,19
[68] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu-
larization. ICLR, 2019. 17
[69] Cathy H Lucas, Daniel OB Jones, Catherine J Hollyhead, Robert H
Condon, Carlos M Duarte, William M Graham, Kelly L Robinson,
Kylie A Pitt, Mark Schildhauer, and Jim Regetz. Gelatinous zoo-
plankton biomass in the global oceans: geographic variation and
environmental drivers. Global Ecology and Biogeography, 2014.
20
[70] Sabarinath Mahadevan, Paul Voigtlaender, and Bastian Leibe. Iter-
atively trained interactive segmentation. BMVC, 2018. 4,17
[71] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc
Van Gool. Deep extreme cut: From extreme points to object seg-
mentation. CVPR, 2018. 6
[72] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik.
A database of human segmented natural images and its applica-
tion to evaluating segmentation algorithms and measuring ecologi-
cal statistics. ICCV, 2001. 10,21,28
[73] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net:
Fully convolutional neural networks for volumetric medical image
segmentation. 3DV, 2016. 5,17
[74] Massimo Minervini, Andreas Fischbach, Hanno Scharr, and
Sotirios A. Tsaftaris. Finely-grained annotated datasets for image-
based plant phenotyping. Pattern Recognition Letters, 2016. 9,20
[75] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes,
Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Debo-
rah Raji, and Timnit Gebru. Model cards for model reporting. Pro-
ceedings of the conference on fairness, accountability, and trans-
parency, 2019. 25,28
14